Biostat 212a Homework 2

Due Feb 8, 2025 @ 11:59PM

Author

Jiaye Tian and 306541095

Published

February 8, 2025

1 ISL Exercise 4.8.1 (10pts)

2 ISL Exercise 4.8.6 (10pts)

3 ISL Exercise 4.8.9 (10pts)

library(dplyr)
library(ggplot2)

data.frame(prob = seq(0, 0.99, 0.01)) %>%
  mutate(odds = prob / (1 - prob)) %>%
  ggplot(aes(x = prob, y = odds)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = 0.5, col = "red") +
  geom_hline(yintercept = 1, col = "red") + 
  coord_cartesian(ylim = c(0, 20)) + 
  labs(x = "p", 
       y = "Odds: p / (1 - p)", 
       title = "Odds vs Probability Relationship")

4 ISL Exercise 4.8.13 (a)-(i) (50pts)

pairs(Weekly[ ,-9])

cor(subset(Weekly, select = -Direction))
              Year         Lag1        Lag2        Lag3         Lag4
Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
               Lag5      Volume        Today
Year   -0.030519101  0.84194162 -0.032459894
Lag1   -0.008183096 -0.06495131 -0.075031842
Lag2   -0.072499482 -0.08551314  0.059166717
Lag3    0.060657175 -0.06928771 -0.071243639
Lag4   -0.075675027 -0.06107462 -0.007825873
Lag5    1.000000000 -0.05851741  0.011012698
Volume -0.058517414  1.00000000 -0.033077783
Today   0.011012698 -0.03307778  1.000000000

There is no apparent strong relationship between the lagged variables.

Weekly$Week <- 1:nrow(Weekly)

year_breaks <- Weekly %>%
  group_by(Year) %>%
  summarize(Week 
            = min(Week))

ggplot(Weekly, aes(x = Week, y = Volume)) + 
  geom_line() + 
  geom_smooth() + 
  scale_x_continuous(breaks = year_breaks$Week, 
                     minor_breaks = NULL, 
                     labels = year_breaks$Year) + 
  labs(title = "Daily Average Shares Traded vs. Time", 
       x = "Time") + 
  theme_light()

In terms of long-term trading volumes, there has been a significant increase in equity trading volumes since the 1990s, peaking around 2009 and beginning to decline in 2010.

ggplot(Weekly, aes(x = Year, fill = Direction)) + 
  geom_bar(position = "fill") +
  geom_hline(yintercept = 0.5, col = "red") +
  scale_x_continuous(breaks = seq(1990, 2010), minor_breaks = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_light() + 
  theme(axis.title.y = element_blank(), 
        legend.position = "bottom") + 
  ggtitle("Up/Down Week % vs. Time")

Here is the direction over time. There appear to be only 4 years (2000, 2001, 2002 and 2008) where 50% of the weeks did not have a positive return.

The table below shows the breakdown between down weeks and up weeks. All we need to do is predict positive weekly returns for the S&P 500 to get a classifier with an accuracy of 55.56%.

prop.table(table(Weekly$Direction))

     Down        Up 
0.4444444 0.5555556 
ggplot(Weekly, aes(x = Week, y = Today / 100)) + 
  geom_line() + 
  scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) + 
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(-0.2, 0.2, 0.05)) + 
  geom_hline(yintercept = 0, col = "red") +
  theme_light() + 
  labs(title = "Weekly % Return vs Time", 
       x = "Time", 
       y = "% Return")

We can also find that markets seem to go through periods of high instability. Market crashes (e.g. September 2008) are particularly prominent here.

Weekly$Direction <- factor(Weekly$Direction, levels = c("Down", "Up"))

Weekly_fits<-
  glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, 
      data=Weekly, 
      family=binomial)
summary(Weekly_fits)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

Based on the summary results, Lag2 appears to be the only statistically significant variable.

weekly_probs <- predict(Weekly_fits, type = "response")
weekly_pred <- rep("Down", 1089)
weekly_pred[weekly_probs >.5]= "Up"
table(weekly_pred, Weekly$Direction)
           
weekly_pred Down  Up
       Down   54  48
       Up    430 557
(557+54)/1089
[1] 0.5610652
557/(557+48)
[1] 0.9206612
(54)/(54+430)
[1] 0.1115702

Based on the results of the confusion matrix, we correctly predicted the weekly trend 56.11% of the time. However, we correctly predicted the upward trend 92.07% of the time and the downward trend only 11.16% of the time.

train = (Weekly$Year < 2009)

Weekly_2009 <-Weekly[!train,]

Weekly_fits<-glm(Direction~Lag2, 
                 data=Weekly,
                 family=binomial, 
                 subset=train)

Weekly_prob= predict(Weekly_fits, 
                     Weekly_2009, 
                     type = "response")

Weekly_pred <- rep("Down", length(Weekly_prob))

Weekly_pred[Weekly_prob > 0.5] = "Up"

Direction_2009 = Weekly$Direction[!train]

table(Weekly_pred, Direction_2009)
           Direction_2009
Weekly_pred Down Up
       Down    9  5
       Up     34 56
mean(Weekly_pred == Direction_2009)
[1] 0.625
56/(56+5)
[1] 0.9180328
9/(9+34)
[1] 0.2093023

Fitting a logistic regression model to the training data with Lag2 as the only predictor: the model accurately predicted the outcome 62.5% of the time. The model also correctly predicted the upward trend 91.80% of the time and the downward trend 20.93% of the time, which is a slight improvement over the previous model.

library(MASS)

Weeklylda_fit<-lda(Direction~Lag2, 
                   data=Weekly, 
                   family=binomial, 
                   subset=train)

Weeklylda_pred<-predict(Weeklylda_fit, Weekly_2009)

table(Weeklylda_pred$class, Direction_2009)
      Direction_2009
       Down Up
  Down    9  5
  Up     34 56
mean(Weeklylda_pred$class == Direction_2009)
[1] 0.625

The results of the LDA modelling were identical to the logistic regression model developed in part (d), with an accuracy of 62.5%.

Weeklyqda_fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)

Weeklyqda_pred <- predict(Weeklyqda_fit, Weekly_2009)$class

table(Weeklyqda_pred, Direction_2009)
              Direction_2009
Weeklyqda_pred Down Up
          Down    0  0
          Up     43 61
mean(Weeklyqda_pred == Direction_2009)
[1] 0.5865385

QDA Creation Model: The model has a 58.65% correct prediction rate. However, the model does not seem to predict the downward trend at all.

library(class)
Week_train <- as.matrix(Weekly$Lag2[train])

Week_test <- as.matrix(Weekly$Lag2[!train])

train_Direction <- Weekly$Direction[train]

set.seed(1)

Weekknn_pred=knn(Week_train,Week_test,train_Direction,k=1)

table(Weekknn_pred,Direction_2009)
            Direction_2009
Weekknn_pred Down Up
        Down   21 30
        Up     22 31
mean(Weekknn_pred == Direction_2009)
[1] 0.5

Creating a KNN (K=1) model, we can see that the model reduces the accuracy to 50%.

library(e1071)

weeklynb_fit <- naiveBayes(Direction~Lag2 ,data=Weekly ,subset=train)

weeklynb_fit

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
     Down        Up 
0.4477157 0.5522843 

Conditional probabilities:
      Lag2
Y             [,1]     [,2]
  Down -0.03568254 2.199504
  Up    0.26036581 2.317485
weeklynb_class <- predict(weeklynb_fit ,Weekly_2009)

table(weeklynb_class ,Direction_2009)
              Direction_2009
weeklynb_class Down Up
          Down    0  0
          Up     43 61
mean (weeklynb_class == Direction_2009)
[1] 0.5865385

Having fitted a naive Bayesian model to the Weekly Dataset, we can see that it produces exactly the same results as the QDA model we fitted in part (f). The accuracy of both models is 58.65%, which is still lower than the 62.5% of the logistic regression model.

  1. The logistic regression model seems to provide the best results, being able to correctly predict the outcome in 62.5% of cases.

5 Bonus question: ISL Exercise 4.8.13 Part (j) (30pts)

set.seed(1)

Weekknn_pred2 <- knn(Week_train,Week_test,train_Direction,k=20)

table(Weekknn_pred2,Direction_2009)
             Direction_2009
Weekknn_pred2 Down Up
         Down   21 21
         Up     22 40
mean(Weekknn_pred2 == Direction_2009)
[1] 0.5865385

Creating a KNN (K=20) model, we were able to increase the accuracy from 50% to 58.65% for the K=1 model created in part (g).

Weeklyqda_fit2 <- qda(Direction ~ Lag2^2, data = Weekly, subset = train)

Weeklyqda_pred2 <- predict(Weeklyqda_fit2, Weekly_2009)$class

table(Weeklyqda_pred2, Direction_2009)
               Direction_2009
Weeklyqda_pred2 Down Up
           Down    0  0
           Up     43 61
mean(Weeklyqda_pred2 == Direction_2009)
[1] 0.5865385

The accuracy of using Lag2^2 in the QDA model is 58.65%, which is the same as the accuracy of the QDA model created in part (f).

Weeklylda_fit2<-lda(Direction~Lag2:Lag3, 
                    data=Weekly,
                    family=binomial, 
                    subset=train)

Weeklylda_pred2<-predict(Weeklylda_fit2, Weekly_2009)

table(Weeklylda_pred2$class, Direction_2009)
      Direction_2009
       Down Up
  Down    0  0
  Up     43 61
mean(Weeklylda_pred2$class == Direction_2009)
[1] 0.5865385

The accuracy of the LDA model with Lag2:Lag3 as predictors is 58.65%, which is lower than the previous LDA model created in part (e).

6 Bonus question: ISL Exercise 4.8.4 (30pts)